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Abstract 

Background: The significant growth in the volume of electronic biomedical data in recent decades has pointed to 
the need for approximate string matching algorithms that can expedite tasks such as named entity recognition, 
duplicate detection, terminology integration, and spelling correction. The task of source integration in the Unified 
Medical Language System (UMLS) requires considerable expert effort despite the presence of various computational 
tools. This problem warrants the search for a new method for approximate string matching and its UMLS-based 
evaluation. 

Results: This paper introduces the Longest Approximately Common Prefix (LACP) method as an algorithm for 
approximate string matching that runs in linear time. We compare the LACP method for performance, precision 
and speed to nine other well-known string matching algorithms. As test data, we use two multiple-source samples 
from the Unified Medical Language System (UMLS) and two SNOMED Clinical Terms-based samples. In addition, we 
present a spell checker based on the LACP method. 

Conclusions: The Longest Approximately Common Prefix method completes its string similarity evaluations in less 
time than all nine string similarity methods used for comparison. The Longest Approximately Common Prefix 
outperforms these nine approximate string matching methods in its Maximum F } measure when evaluated on 
three out of the four datasets, and in its average precision on two of the four datasets. 



Background 

The term-matching problem has been widely addressed in 
multiple contexts, which resulted in a number of string 
similarity metrics designed, applied and evaluated in vari- 
ous research studies [1]. In the biomedical domain, vari- 
ous ASM methods are used by scientists to solve current 
research tasks such as retrieving sequences from existing 
databases that are homologous to newly discovered ones, 
and establishing multiple sequence alignment to discover 
similarity patterns to predict the function, structure, and 
evolutionary history of biological sequences [2]. 

The recent expansion of healthcare information sys- 
tems that draw from multiple medical databases has 
resulted in redundant information, among other prob- 
lems. This phenomenon, also known as the duplicate 
detection problem, has caused problems with record 
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linkage across medical databases. Previous research 
has addressed problems such as patient record aggre- 
gation from multiple databases based on a minimum 
profile (i.e., name, gender and date of birth) [3] and term 
matching for source integration, spelling correction and 
biomedical data mining applications. In this paper, these 
tasks are considered in the context of terminologies such 
as Systemized Nomenclature of Medicine Clinical Terms 
(SNOMED CT) and the Unified Medical Language System 
(UMLS) [4]. Approximate String Matching (ASM) methods 
are used for augmenting, updating, and auditing UMLS 
vocabularies. ASM methods are also important for facili- 
tating biomedical information extraction, relationship 
search, and concept discovery [5]. 

The UMLS is an extensive terminological knowledge base 
comprised of three major components: the Metathesaurus, 
the Semantic Network, and the SPECIALIST Lexicon and 
Lexical Tools. The current 2013AB release of the Metathe- 
saurus contains more than 2.9 million concepts and 11.4 
million unique terms retrieved from over 160 source vo- 
cabularies [6]. UMLS source integration is a complicated 
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multistep process and, despite the availability of numerous 
algorithmic tools, managing these vocabularies requires 
considerable human involvement. As additional sources 
are integrated into the UMLS, they will require reintegra- 
tion with existing vocabularies [4]. 

These disadvantages motivate the search for a new 
method for approximate string matching and UMLS- 
based evaluation. In this paper, we introduce the Longest 
Approximately Common Prefix (LACP) method for 
ASM and present the results of its use to improve the 
operation of a number of applications in biomedical in- 
formatics and related domains. 

It bears noting that, in contrast to the well-known 
SPECIALIST lexicon tools Norm, Word Index or LVG 
[7], LACP does not perform text manipulations. Instead, 
it assesses the similarity or dissimilarity of two strings. 

Other three highly praised instruments, MetaMap [8], 
NCBO Annotator [9] and ConceptMapper [10] are pub- 
licly available concept recognition systems designed for 
text annotation from various ontologies [11]. The gen- 
eral rationale of these tools is to split the input text into 
smaller constructions, such as phrases or tokens, which 
are subsequently looked up in a dictionary. For instance, 
MetaMap splits the input text into phrases and produces 
their variants. Then it generates a candidate set, which is 
mapped to an ontology. The LACP method, introduced 
in this paper, may be used as an inner component of 
such a system for calculating the similarity of a candi- 
date phrase or token when matching to various ontology 
terms. The authors consider implementation of a text 
annotation system incorporating the LACP method as a 
direction for future research. 

The rest of the section is dedicated to the analysis of 
the relevant research approaches and the related work 
studying the application of well-known similarity mea- 
sures in the biomedical domain. 

Tan et al. [12] applied the classic Levenshtein score in- 
corporated with a particular threshold to medical ontology 
alignment. Tolentino et al. [13] utilized the Levenshtein 
technique in combination with other string similarity al- 
gorithms to construct a UMLS -based spell checker. Sahay 
et al. [14] employed more advanced combinations of the 
Jaro and Jaro-Winkler similarity metrics combined with 
Term Frequency/Inverse Document Frequency (TFIDF) 
to compute similarity values between ontological concepts 
and phrases. Cohen et al. [15] described, implemented 
and evaluated the above-mentioned hybrid distances in 
the SecondString Java toolkit. 

Plaza et al. [16] applied heuristic rules with a clustering al- 
gorithm to the problem of biomedical text summarization. 
Their work mapped terms found in a given document to 
UMLS concepts. Using the relationships between the 
identified UMLS concepts, the authors then represented 
the document in a graph. They graphed the concepts and 



assigned sentences to clusters based on semantic similar- 
ity. Finally, the most important sentences were selected to 
be included in a document summary. 

Zhen et al. [17] introduced a TFIDF string distance 
method within their clustering algorithm and applied it 
to biomedical ontologies. The evaluation of their method 
demonstrated superior values of the F-measure on two 
datasets derived from the MeSH and GO ontologies. 

In a previous paper, we developed a novel Markov 
Random Field-based Edit Distance (MRFED) and applied 
it to the ASM problem in GO ontologies [18]. Similarly, 
Wellner et al. [19] used Conditional Random Fields in a 
distance metric method on a UMLS Metathesaurus data- 
set. Bodenreider et al. [20] applied the Cosine, Jaccard 
and Dice string similarity coefficients to aligning the 
UMLS Semantic Network with the Metathesaurus. 

Yamaguchi et al. [21] tested four similarity metrics for 
clustering terms, which appeared in the UMLS 
Metathesaurus. The authors compared the performances 
of Monge-Elkan, SoftTFIDF, Jaro-Winkler and the bigram 
Dice coefficient methods evaluating these techniques on 
chemical and non-chemical terms grouped into two data- 
sets. They demonstrated that normalized string distances 
performed better than the standard measures for the 
evaluation of precision, recall, and F-measure, and that 
similarity metrics required different parameters such as 
threshold values for chemical and non-chemical terms, 
among other findings. 

Sauleau et al. [22] propose a novel method for linking 
medical records by examining the connections between 
stand-alone and clustered databases. The authors devel- 
oped a three-step approach: 1) preprocessing the data and 
applying blockers, 2) matching pairs of records using the 
Porter-Jaro-Winkler score calculation, and 3) clustering 
the data. The authors suggest that their method is useful 
for inserting new entities into large databases. 

Zunner et al. [23] studied the semi-automated mapping 
of non-English terms to Logical Observation Identifiers 
Names and Codes (LOINC) [24] using the Regenstrief 
LOINC Mapping Assistant (RELMA) [25]. Their approach 
resulted in a mapping rate of 500 terms per day, which 
they considered satisfactory. 

In research by Parcero et al. [26], mapping a local ter- 
minology to the LOINC dataset led to the development 
of an automated tool that uses an approximate string 
matching function. McDonald et al. benchmarked Jaccard, 
Levenshtein, Monge-Elkan, and Soft TFIDF metrics for 
LOINC integration, and the Jaccard method was selected 
as the best choice for such a task [24]. 

The present research employs the Shortest Path Edit 
Distance (SPED) algorithm we developed previously [27] 
to compute a string distance based on substring matching 
and graph-based transformations. To adjust the dissimi- 
larity values in the final results, we applied a re-scorer set 
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according to the length of equal string prefixes. This final 
step produced a major improvement in results and in- 
spired this paper on the Longest Approximately Common 
Prefix (LACP) method, a novel string similarity metric 
based on the approximate prefix match of two strings. 
This paper demonstrates how this fast string distance 
method provides performance that is superior to other 
methods on datasets from SNOMED CT and from multiple 
UMLS sources (Table 1) in terms of average precision and 
Maximum Fj. 

Methods 

The Longest Approximately Common Prefix (LACP) 
method is based on an approximate histogram match of 
string prefixes. It identifies matches by determining the 
similarity value of a pair of strings. The method com- 
pares the histogram differences between the prefixes of 
two strings to parameter a. It begins its search in the 
first characters of the strings. The prefix length is 
returned when the histogram difference is equal to a or 
the last character of the shorter string is reached. The 
prefix length is then divided by the average length of the 
pair of strings. The division takes into consideration 
string lengths, since strings that have significantly vary- 
ing lengths are more dissimilar than strings that do not. 
The division also assures that the value of the LACP 
function stays in the [0, 1] interval. The formula for the 
LACP function (1) is as follows: 



LACP(S, T) = 1 



prefLength(S ', T) 

asi + m)/2 



(i) 



where prefLength is the length of the longest approximately 
common prefix. According to formula (1), for two identical 
strings, LACP is 0, whereas LACP is 1 for two strings not 
sharing any common prefix under a certain selection of 
the parameter a. The formula for prefLength is given in (2) 
below: 



prefLength = ^i\(prefHistDiff(S 1 
= a)n (prefHistDijf (S 



2Y,) 



i) <«)} 
(2) 



where prefHistDijf is a histogram difference function of 
string prefixes, a is a parameter, and S^j and T l d are 
prefixes of strings S and T of length L For example, for 



the strings S = Anorexia and T = Angina, with an a = 2, 
the prefLength would be 3, because two initial characters 
match and a allows only one mismatch. Alternatively, 
with a = 3 the prefLength would be 4 because two mis- 
matches are allowed. 

The histogram difference function for string prefixes is 
defined in formula (3): 

prefHistDiff (S\ Ti. mi ) = i-\hist(Si , d )r\hist(J \ 

(3) 

where hist is a histogram, and i satisfies the inequality (4): 

l<i<min(\S\, \T\) (4) 

A histogram is an array, that counts the number of oc- 
currences of each distinct symbol in a string. In formulae 
(2) and (3), i denotes a prefix length. By subtracting 
from i the number of characters that are common to the 
histograms of both prefixes, the number of non-common 
characters remains in the difference. This number of non- 
common prefixes is matched against the parameter a, as 
is shown in formula (2). During the evaluation phase, we 
used a = 3, which allowed two mismatches in histogram 
difference. 

The expression hist(Su) n hist(Ti j) denotes the histo- 
gram intersection of two string prefixes. Figure 1 depicts 
the histogram intersection of two UMLS terms, ammo- 
nium and ammonium ion. The histogram of ammonium 
is in Figure la, the histogram of ammonium ion is in 
Figure lb. The intersection (Figure lc) is computed as 
the minimum for each pair of argument values of the 
same character, with missing values in one argument 
omitted from the result. 

For example, ammonium contains one "o" while there 
are two letters "o" in ammonium ion. As min(l, 2) = 1, 
the resulting histogram in Figure lc contains the entry 
"1" for the letter "o". As there is no blank in ammonium, 
there is also no entry for the blank character in the 
resulting histogram. In order to compute the size (the 
"absolute value" ||) of the histogram intersection in 
Figure lc, the sum of all the numbers in the result matrix 
is calculated. For Figure lc, the size of the histogram inter- 
section is (1 + 1 + 3 + 1 + 1 + 1) = 8. 

An example of three strings sharing the same prefix is 
shown in Table 2. Strings (1) and (2) comprise the first 
pair, and strings (1) and (3) form the second pair. Clearly, 



Table 1 Four medical informatics datasets used in experiments 


# Dataset 


# of concepts 


# of terms 


Size in kilobytes 


Di The UMLS most frequent concepts from multiple sources 


100 


4,979 


369 


D 2 The SNOMED CT most frequent concepts 


155 


5,000 


281 


D 3 The UMLS concepts with longest terms ("longest concepts") 


3,337 


5,000 


1,693 


D 4 The SNOMED CT longest concepts 


1,805 


5,000 


903 
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Figure 1 Example of histogram intersection. The expression hist(S^]) n hist{T h j) denotes the histogram intersection of two string prefixes. 
Depicts the histogram intersection of two UMLS terms, ammonium and ammonium ion. The histogram of ammonium is in a, the histogram of 
ammonium ion is in b. The intersection (c) is computed as the minimum for each pair of argument values of the same character, with missing 
values in one argument omitted from the result. For example, ammonium contains one "o" while there are two letters "o" in ammonium ion. As 
min (1, 2) = 1, the resulting histogram in c contains the entry "1" for the letter "o." As there is no blank in ammonium, there is also no entry for 
the blank character in the resulting histogram. In order to compute the size (the "absolute value" ||) of the histogram intersection in c, the sum of 
all the numbers in the result matrix is calculated. For c, the size of the histogram intersection is (1+1+3 + 1+1+1) = 8. 



the first pair of strings is more similar than the second 
pair. To account for this and similar cases, the length of 
the approximately common prefix is divided by the aver- 
age string length in formula (1). In Table 2, strings (1) and 
(2) belong to the UMLS concept with Concept Unique 
Identifier (CUI) C0002611, while string (3) is associated 
with (CUI) C1816069. 

The LACP algorithm is in Table 3. The algorithm begins 
by setting the histogram intersection at 0. The search for 
the longest approximately common prefix begins with the 
first character of each string. In steps 3 and 4, the charac- 
ters at the current position i of strings S and T are added 
to the corresponding histograms. In steps 5 through 9, all 
characters in the histogram of string S are compared 
against the histogram of string Tat the current iteration i. 
At this point, the search has advanced to the i-th character 
of each string. Steps 6 and 7 describe the following: when 

Table 2 UMLS terms sharing the same longest 
approximately common prefix 

# String Length 

1 Ammonium 8 

2 Ammonium ion 12 

3 AMMONIUM-CHLORIDE 1 MG/CYANOCOBALAMIN 369 
5 MCG/FERRIC AMMON IUM CITRATE 40 MG/FOLIC ACID 

1 MG/LYSINE HYDROCHLORIDE 100 MG/MAGNESIUM 
SULFATE 1 MG/MANGANESE SULFATE ANHYDROUS 
1 MG/NIACIN 5 MG/PANTHENOL 1 MG/POTASSIUM SULFATE 
1 MG/PYRIDOXINE HYDROCHLORIDE 0.5 MG/RIBOFLAVIN 
1.2 MG^HIAMINE HYDROCHLORIDE 12 MG/ZINC SULFATE 
1 MG ORAL LIQUID [HEMERGON] 



a character c is found in both histograms, operation Get(c) 
retrieves the count of this character from both HistS and 
HistT. Then the smaller of the two values is added to the 
intersection. The search continues until the parameter a 
is reached, as shown in line 9, or the last character of the 
shorter string is processed, as specified in line 2. In the lat- 
ter case, the length of the shorter string is computed in 
line 11. 

Table 3 Algorithm of the LACP method 



No Line Complexity 

1 Intersection = 0 0(1) 

2 FOR /'= 1 to min(|S|,|7l) O(n) 
BEGIN 

3 H/sfS.add(Sj) 0(1) 

4 H/sr7.add(73 0(1) 

5 FOR (Char c : HistT. Key set()) Constant 

BEGIN 

6 IF /-//sfS.ContainsKey(c) 0(1) 

7 THEN Intersection = 0(1) 



Intersection + min(f//sr5.Get(c), HistT.Get(c)) 

8 END 

9 IF (/' - Intersection) = a 

THEN RETURN 1- (s ., ength() ;-;., ength())/2 ^ 

10 END 

11 RFTURN 1 min d 5 M ;r l) nm 

II nt UnIN (s.length()+r.length())/2 U[l) 

Total complexity 0(n) 
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Despite its linear time computational complexity, the 
simplicity of the LACP algorithm ensures a short execu- 
tion time. The big-O computational complexity is com- 
monly used for estimating the speed of an algorithm in 
computer science. The calculation of the LACP method 
time complexity is shown in Table 3. The inner loop in 
step 5 is bound by the number of printable characters 
and therefore constant [28]. Thus, the complexity of the 
LACP algorithm is linear, i.e., 0(n), which is fast com- 
paring to other algorithms evaluated in this paper. 

LACP-based interactive spell checker 

We have employed the LACP method to develop an inter- 
active online spell checker [29] for SNOMED CT terms. 
The spell checker is a program written in PHP, which con- 
nects to a MySQL database containing SNOMED CT 
terms from the 2009AB edition of the UMLS. The goal 
of the application is to evaluate LACP performance by 
revealing the set of SNOMED CT terms that are similar 
to the user-provided input term. 

The spell checker accepts an input query and inter- 
actively outputs the SNOMED CT terms satisfying the 
condition LACP(S, T) < t. Here, S is the input term, T is 
a SNOMED CT term, and t is a threshold. To reduce 
the run time, the algorithm limits the set of search terms 
by applying length criteria as described below. 

There are several parameters that define the perform- 
ance of the spell checker depending on the mode of op- 
eration. The length of a SNOMED CT term | T\ that is 
considered a potential match is bound by formulas (8), 
(10), and (11) in conformity with each of the three 
modes of operation. Parameters A and B are used in (11) 
to determine the values of the lower and upper limits 
for \T\, respectively. Parameter a sets the upper bound 
for the number of allowed character mismatches in the 
prefixes of strings S and T. Threshold t defines the "cut- 
off point" for the LACP score; a pair of strings S and T 
is considered to be a match when the LACP score is less 
than the threshold t 

Three modes of operation are implemented: (a) a search 
with dynamically estimated parameters; (b) a search with 
static parameters; and (c) a search with user-defined pa- 
rameters. In case (a), the search is limited to the data- 
base terms meeting the criterion (5), while a is defined 
in (6) and threshold £ is 0.1. 

max (o,\S\-^-3^ <|r|<|5|+^ + 3 (5) 

For example, for string S = Ischemia, |«S| =8. Thus, ac- 
cording to (5), the dynamic search would be limited to 
terms longer than 4 characters and shorter than 12 



characters. In case (a), parameter a is set individually for 
each pair of strings S and Tas shown in (6): 

. = « (6) 

In case (b), a is set to 1, threshold t is 0.1, and the 
length of a term should be in the following range (7): 

max(0,\S\-3) < \T\ < \S\ + 3 (7) 

In case (c), a user selects parameter values from prede- 
fined sets. The search is restricted to terms with lengths 
within the interval (8). 

max(0, \S\-A) < \T\ < \S\ +B (8) 

Parameters A, B, and a are constrained to integers in 
the interval 1..15, and threshold t must be selected from 
the set {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}. 

The dynamic search option adjusts the number of 
allowed misspellings a along with minimum and maximum 
term length parameters according to the input query. The 
dynamic search offers flexibility without user intervention. 
The threshold t is set to 0.1 for this search mode. 

The static search option operates with constant param- 
eter values. It allows only one misspelling. The lengths of 
the returned strings must be in the neighbourhood of ±3 
characters of the input query length. This option de- 
creases the search time for longer input terms compared 
to the dynamic search option. 

The search mode based on user-defined parameters 
expands parameter options within pre-defined ranges. 
This mode is intended for users who are not satisfied 
with the results of the dynamic and static modes or who 
seek more refined results. 

In summary, the dynamic option is suggested when re- 
sults significantly vary in length from the search query. 
The static search option should be used when the result- 
ing strings is expected to lie in the neighbourhood of the 
input term. The search with user-defined parameters is 
intended for fine-tuning results or for a more advanced 
search. 

Results 

The LACP was compared to nine other well-known ap- 
proximate string distance metrics: Jaccard [30], Jaro [31], 
Jaro- Winkler [32], Levenshtein [33], Monge-Elkan [34], 
Needleman-Wunsch [35], Smith- Waterman [36], TFIDF 
[37], and Soft TFIDF [15]. LACP was compared with 
these string matching methods on four datasets derived 
from Version 2009AB of the UMLS (Table 1). Dataset D Y 
was obtained by counting occurrences of each Concept 
Unique Identifier (CUI) within the UMLS [38], retrieving 
all terms corresponding to the 100 most frequent CUIs 
and eliminating records with duplicate terms. D 2 was 
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created in the same way, but limited to concepts from 
SNOMED CT [39]. D 3 was built by retrieving the 5,000 
longest terms from the multiple UMLS sources. D 4 was 
constructed by taking the 5,000 longest terms from 
SNOMED CT. 

SecondString [17], an open-source Java toolkit, was 
used as an experimental test bed. During the experi- 
ments, each term was matched against those within a 
set of candidate pairs. This type of set reduces the prob- 
lem size and speeds up experiment execution. The can- 
didate set includes pairs of terms from the dataset that 
share one or more common words. The goal was to de- 
termine whether every pair of terms has the same CUI. 
Using common performance evaluation methods from 
information retrieval [27], we calculated average preci- 
sion (P), recall (R) and Maximum F x values (formulae 
(9), (10), and (11)), and graphed precision-recall (P-R) 
curves for our method and for the competing tech- 
niques. Precision and recall are tradeoffs against one an- 
other: on the one hand, it is possible to obtain the 
maximum value of recall with a low value of precision 
by retrieving all documents for all queries. On the other 
hand, the precision usually decreases as the number of 
retrieved documents grows. A single measure that trades 
off precision versus recall is the F measure, which is the 
weighted harmonic mean of precision and recall [40]. 





D r 


p = 






D t 


R = 


D r 






N r 




2P*R 


Fi = 






~ P + R 



In (9) and (10), D r denotes the number of relevant 
items retrieved, D t is the total number of retrieved items, 
and N r is the number of relevant items in the collection. 



Table 4 Average precision P 



Dataset 




D 2 


D 3 


D 4 


Jaccard 


0.31 


0.33 


0.22 


0.54 


Jaro 


0.26 


0.40 


0.14 


0.69 


J a ro- Winkler 


0.44 


0.45 


0.14 


0.69 


Levenshtein 


0.16 


0.21 


0.18 


0.54 


Monge-Elkan 


0.22 


0.32 


0.12 


0.65 


Needleman-Wunsch 


0.16 


0.21 


0.18 


0.54 


Smith-Waterman 


0.18 


0.16 


0.09 


0.34 


TFIDF 


0.51 


0.55 


0.25 


0.69 


Soft TFIDF 


0.51 


0.55 


0.25 


0.69 


LACP 


0.62 


0.51 


0.12 


0.84 



Table 5 Maximum F n 



Dataset 








Da 


Jaccard 


0.33 


0.38 


0.37 


0.59 


Jaro 


0.33 


0.49 


0.28 


0.77 


Jaro-Winkler 


0.56 


0.57 


0.28 


0.77 


Levenshtein 


0.21 


0.28 


0.33 


0.65 


Monge-Elkan 


0.24 


0.37 


0.26 


0.67 


Needleman-Wunsch 


0.21 


0.28 


0.33 


0.65 


Smith-Waterman 


0.21 


0.22 


0.18 


0.38 


TFIDF 


0.49 


0.58 


0.40 


0.70 


Soft TFIDF 


0.49 


0.58 


0.40 


0.71 


LACP 


0.69 


0.67 


0.27 


0.92 



Note: The best values for each column are formatted in bold italics. 



LACP achieves the highest average precision for data- 
sets D 1 and D 4 (Table 4) and the best values of Maximum 
F 1 for D lt D 2 , and £> 4 (Table 5). TFIDF and Soft TFIDF 
achieve the best scores of average precision for D x and D 2 
and the largest Maximum F x for D 3 . It is worth noting that 
TFIDF and Soft TFIDF demonstrate exactly the same 
values of average precision and Maximum F x for each 
dataset, although Soft TDIDF executes the operation at 
a significantly slower pace. 

Table 6 shows that LACP is the fastest method on 
every dataset. Figure 2 depicts four precision-recall 
charts plotting interpolated precision values at 11 re- 
call levels [27]. The horizontal axis shows 11 recall 
points; the vertical axis displays interpolated precision 
values. A method with a larger area under its curve 
demonstrates a better result. The differences in per- 
formance between LACP, TFIDF and Soft TFIDF are easily 
apparent. For D x and D 4> LACP consistently outperforms 
the other two methods. It is important to note, however, 
that on D 2 , LACP experiences a rapid precision drop 
after recall = 0.5, and that on D 3 , LACP is inferior to 
most methods. 



Table 6 Execution time in seconds 



Dataset 




D 2 


D 3 


D 4 


Jaccard 


70 


20 


568 


324 


Jaro 


105 


25 


3,637 


1,102 


Jaro-Winkler 


115 


26 


3,617 


1,265 


Levenshtein 


1,273 


301 


57,811 


16,596 


Monge-Elkan 


6,240 


1,340 


258,502 


77,555 


Needleman-Wunsch 


1,294 


258 


57,982 


15,918 


Smith-Waterman 


1,444 


293 


58,753 


17,519 


TFIDF 


132 


37 


928 


558 


Soft TFIDF 


208 


144 


186,937 


11,983 


LACP 


40 


77 


202 


233 



Note: The best values for each column are formatted in bold italics. 



Note: The best values for each column are formatted in bold italics. 
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Dataset D 



Dataset Di 




Dataset Ds 



Dataset Da 




0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 




0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 



A Jaccard 

• • • Jaro 

-Jaro-Winkler 
— * — Levenshtein 

• 0— Monge-Elkan 

■W Needleman-Wunsch 
■ ■ - ■ Smith-Waterman 
-■ — TFIDF 
*■ -LACP 



Figure 2 Precision-recall curves of the evaluated methods. Figure 2 depicts four precision-recall charts plotting interpolated precision values 
at 1 1 recall levels. The horizontal axis shows 1 1 recall points; the vertical axis displays interpolated precision values. A method with a larger area 
under its curve demonstrates a better result. The differences in performance between LACP, TFIDF and Soft TFIDF are easily apparent. For Di and 
D 4i LACP consistently outperforms the other two methods. It is important to note, however, that on D 2 , LACP experiences a rapid precision drop 
after recall = 0.5, and that on D 3 , LACP is inferior to most methods. 



Discussion 

The primary advantage of the LACP method is its short 
execution times, a feature that is highly desirable when 
dealing with the large data sets involved in Medical In- 
formatics. The performance of the LACP method can 



be interpreted by studying the structure of the datasets 
Di, -D 4 . Datasets D lf D 2 , and D 4 have higher numbers 
of terms per concept compared to dataset D 3 (see Table 1). 
Thus, D lf D 2 , and D 4 have a higher number of records 
that have the same CUIs and have approximately common 



Table 7 Example of similar terms with different concept IDs from dataset D 3 

CUI Term 

C0602912 Yohimban-16-carboxylic acid, 1 1,17-dimethoxy-18-((3,4,5-trimethoxybenzoyl)oxy)-, methyl ester, (3beta,16beta,17alpha,18beta,20alpha)-, mixt. 
with 4-chloro-N(1 )-methyl-N(1 )-((tetrahydro-2-methyl-2-furanyl)methyl)-1 ,3-benzenedisulfonamide and 3-hydroxy-alpha-methyl-L-tyrosine 

C0053099 Yohimban-16-carboxylic acid, 1 1,17-dimethoxy-18-((3,4,5-trimethoxybenzoyl)oxy)-, methyl ester, (3beta,16beta,17alpha,18beta,20alpha)-, mixt. 

with 4-chloro-N(1 )-methyl-N(1 )-((tetrahydro-2-methyl-2-furanyl)methyl)-1 ,3-benzenedisulfonamide and myo-inositol hexa-3-pyridinecarboxylate 

C0050737 Yohimban-16-carboxylic acid, 1 1,17-dimethoxy-18-((3,4,5-trimethoxybenzoyl)oxy)-, methyl ester, (3beta,16beta,17alpha,18beta,20alpha)-, mixt. 
with 6-chloro-3,4-dihydro-2H-1,2,4-benzothiadiazine-7-sulfonamide 1,1 -dioxide and 1(2H)-phthalazinone hydrazine 

C0600796 Yohimban-16-carboxylic acid, 1 1,17-dimethoxy-18-((3,4,5-trimethoxybenzoyl)oxy)-, methyl ester, (3beta,16beta,17alpha,18beta,20alpha)-, mixt. 

with 6-chloro-3,4-dihydro-2H-1,2,4-benzothiadiazine-7-sulfonamide 1,1 -dioxide and 5-ethyl-5-(1-methylpropyl)-2,4,6(1H,3H,5H)-pyrimidinetrione 
monosodium salt 

C0602088 Yohimban-16-carboxylic acid, 1 1,17-dimethoxy-18-((3,4,5-trimethoxybenzoyl)oxy)-, methyl ester, (3beta,16beta,17alpha,18beta,20alpha)-, mixt. 

with 6-chloro-3,4-dihydro-2H-1,2,4-benzothiadiazine-7-sulfonamide 1,1 -dioxide, 1 (2H)-phthalazinone hydrazone and potassium chloride (KCI) 
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prefixes. This allows the LACP algorithm to outperform 
other more complicated well-known methods on D lf D 2 , 
and D 4 . 

However, the LACP method performed poorly on D 3 . 
This is due to the large number of concepts with simi- 
lar terms. As shown in Table 7, five terms share a 146- 
character-long common prefix, for example. By design, 
such terms are evaluated by LACP as very similar, which 
in fact is incorrect. Large numbers of such similarly spelled 
UMLS terms with different identifiers leave no chance for 
the LACP algorithm to succeed in these contexts. 

We note that the current online spell checker is a 
prototype. It has not been optimized for speed nor is it 
intended to compete with the well-known Google In- 
stant Search [10], which displays search predictions as 
the user types a query. Instead, our goal is to create a 
spell checker specifically for use with biomedical termin- 
ologies. The remarkable difference between the excellent 
performance of LACP on datasets D 1} D 2 , and D 4 and its 
disappointing performance on D 3 indicates that approxi- 
mate string matching methods exhibit a certain degree 
of domain dependence. In fact, as detailed in an exten- 
sive research report by Rudniy [41], domain dependence 
has been shown to be a common phenomenon. 

Conclusions 

LACP is a novel method we have developed for comput- 
ing approximate string similarities based on assessing 
the length of approximately common string prefixes. 
The algorithm implements a normalization technique by 
dividing the length of the approximately common prefix 
by the average length of the pair of strings. LACP per- 
formed better than a number of well-known string simi- 
larity algorithms on three out of four datasets and 
demonstrated the shortest execution times on all four. 
For the average precision measure, LACP achieved the 
highest values of 0.62 on dataset D x and 0.84 on dataset 
£) 4 . On D 3 , LACP was second best, with an average pre- 
cision of 0.51. Our method had the best values of Max- 
imum Fi on three datasets: 0.69 on D lf 0.61 on D 2 , and 
0.92 on £) 4 . However, LACP experienced a drop in per- 
formance on dataset D 3 . In terms of execution time, 
LACP was on average two times faster than the Jaccard 
method, which achieved the second best times. 

The LACP method demonstrated superior perform- 
ance on certain types of biomedical datasets though its 
productivity has to be determined for other corpora. An- 
other common limitation of the approximate string 
matching methods lies in the inability to determine that 
differently spelled synonyms correspond to the same 
concept. For such cases, either semantic methods or ex- 
pert insight are required. 

In future work, we will attempt to identify the cause 
and solve the problem of performance variability due to 



differences in dataset characteristics. Another branch of 
future research consists of investigating the best value 
for parameter a. The ultimate— though difficult— goal is 
to develop an approximate string matching method that 
recognizes and adapts to the distinctive characteristics of 
each dataset. 
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